There is a large, diverse literature on creating indexes of what amounts to “Socioeconomic status” in spatial areas like neighborhoods. These indexes can then be used to investigate correlations between socioeconomics and health outcomes, infrastructure access, environmental exposures, or other variables, or target certain neighborhoods that seem particularly vulnerable for interventions. Many variables can be used to contribute to such an index. However, it is always difficult to choose which variables and how to weight them. For example, the USEPA EJSCREEN project tracks 6 demographic variables, but only uses two when constructing demographic indices to weight neighborhoods for an equity- weighting of environmental exposures: a simple average of % of population low-ioncome and % population “minority” (nonwhite alone/nonhispanic). One method that has been popular in social science and public health policy has been Principal Components Analysis (PCA). See this study for a foundational example. PCA creates linear combinations of sets of variables that explain the maximum amount of overall variance in the complete dataset. This amounts to weighting the variables that have the widest spread and that covary with other variables. In this way, any arbitrary combination of variables of interest to policymakers/planners/analysts can be chosen, and a linear combination of them constructed for use as a single or multiple indices in an automated manner. This approach has the added value of being customizable for different communities, where different axes of inequity may be more or less important or impactful than in other communities.
In the United States, typically this activity has been done to identify regions with socioeconomic correlations with cancer incidence. The typical source for variables has been the U.S Census American Community Survey, which includes some detailed demographic information about race/ethnicity, education levels, income, occupation, and housing quality, among other topics.
Two or more of the following variables are often used:
Some studies using them are listed below:
First, we take an example service area boundary. In this case the Newark Water Department.This comes from an external API but could be read from an arbitrary local or web location.
#pws <- read_sf("C:/Users/kyleo/Downloads/pws.gpkg")
#newark <- filter(pws,PWSID=="NJ0714001")
newark <- read_sf("https://geoconnex.us/ref/pws/NJ0714001")
mapview(newark)
Then we find the relevant Census Block Groups (orange) or Tracts (green), first downloading the entire state and then subsetting to the utility service area (blue).
bg <- get_acs(state = "NJ", geography = "block group", variables = "B19013_001", geometry = TRUE) %>% st_transform(4326)
## Getting data from the 2015-2019 5-year ACS
tr <- get_acs(state = "NJ", geography = "tract", variables = "B19013_001", geometry = TRUE) %>% st_transform(4326)
## Getting data from the 2015-2019 5-year ACS
bg1 <- st_intersects(bg,newark) %>% as.integer()
bg <- bg[which(bg1==1),]
tr1 <- st_intersects(tr,newark) %>% as.integer()
tr <- tr[which(tr1==1),]
mapview(newark, layer.name="Newark Water Department", map.types= "OpenStreetMap") +
mapview(bg, alpha.regions=0, color="orange", col.regions="orange",lwd=2, layer.name="Block Groups", map.types= "OpenStreetMap") +
mapview(tr, alpha.regions=0, color="green",col.regions="green", lwd=3, layer.name="Tracts", map.types= "OpenStreetMap")
Now we download the necessary census data and match them to the appropriate census boundaries.
vars <- tidycensus::load_variables(2019, "acs5", cache = TRUE)
cv <- c(pop_race_count = "B03002_001",
pop_nonhispanic_white_alone = "B03002_003",
pop_educ_attainment_count = "B15003_001",
pop_educ_none = "B15003_002", #0
pop_educ_nursery = "B15003_003", #0
pop_educ_kindergarten = "B15003_004", #0
pop_educ_grade1 = "B15003_005", #1
pop_educ_grade2 = "B15003_006", #2
pop_educ_grade3 = "B15003_007", #3
pop_educ_grade4 = "B15003_008", #4
pop_educ_grade5 = "B15003_009", #5
pop_educ_grade6 = "B15003_010", #6
pop_educ_grade7 = "B15003_011", #7
pop_educ_grade8 = "B15003_012", #8
pop_educ_grade9 = "B15003_013", #9
pop_educ_grade10 = "B15003_014", #10
pop_educ_grade11 = "B15003_015", #11
pop_educ_grade12_nodiploma = "B15003_016", #11.5
pop_educ_HSdiploma = "B15003_017", #12
pop_educ_GED = "B15003_018", #12
pop_educ_college_less_1year = "B15003_019", #13
pop_educ_college_more_1year_nodegree = "B15003_020", #13.5
pop_educ_assoc = "B15003_021", #14
pop_educ_bachelor = "B15003_022", #16
pop_educ_master = "B15003_023", #18
pop_educ_prof = "B15003_024", #19
pop_educ_doc = "B15003_025", # 21
poverty_level_total = "C17002_001",
poverty_level_gr_200FPL = "C17002_008" ,
hh_income_20ptile = "B19080_001",
hh_income_median = "B19013_001",
pop_hh_income_assistance_total = "B09010_001",
pop_in_hh_with_income_assistance = "B09010_002",
per_capita_income = "B19301_001",
pop_labor_force = "B23025_001",
pop_unemployed = "B23025_005",
housing_units_count = "B25002_001",
housing_units_vacant = "B25002_003",
housing_units_occupied = "B25002_002",
pop_units_count = "B25033_001",
pop_owner_occupied_units = "B25033_002",
pop_rental_units = "B25033_008",
housing_units_plumbing_count = "B25050_001",
housing_units_plumbing_complete = "B25050_002",
rent_median = "B25058_001",
median_gross_rent_percent_hh_income = "B25071_001",
home_value_median = "B25077_001",
hh_owner_occupied = "B25014_002",
hh_owner_occupied_1.5_2_perRoom = "B25014_006",
hh_owner_occupied_gr2.0_perRoom = "B25014_007",
hh_renters = "B25014_008",
hh_renters_1.5_2_perRoom = "B25014_012",
hh_renters_gr2.0_perRoom = "B25014_013",
hh_lang_total = "C16002_001",
hh_lang_ltd_api ="C16002_010",
hh_lang_ltd_otherio = "C16002_007",
hh_lang_ltd_spanish = "C16002_004",
hh_lang_ltd_other = "C16002_013",
pop_age_total = "B01001_001",
pop_age_male_65_66 = "B01001_020",
pop_age_male_67_69 = "B01001_021",
pop_age_male_70_74 = "B01001_022",
pop_age_male_75_79 = "B01001_023",
pop_age_male_80_84 = "B01001_024",
pop_age_male_85up = "B01001_025",
pop_age_female_65_66 = "B01001_044",
pop_age_female_67_69 = "B01001_045",
pop_age_female_70_74 = "B01001_046",
pop_age_female_75_79 = "B01001_047",
pop_age_female_80_84 = "B01001_048",
pop_age_female_85up = "B01001_049",
occupation_total = "C24010_001",
occupation_sales_office_male = "C24010_027",
occupation_sales_office_female = "C24010_063",
occupation_service_male = "C24010_019",
occupation_service_female = "C24010_055",
occupation_protective_service_male = "C24010_021",
occupation_protective_service_female = "C24010_057",
occupation_natresource_male = "C24010_030",
occupation_natresource_female = "C24010_066",
occupation_prodtrans_male = "C24010_034",
occupation_prodtrans_female = "C24010_070",
hh_vehicle_count = "B25044_001",
hh_owner_noVehicle = "B25044_003",
hh_renter_noVehicle = "B25044_010",
families_total = "B11003_001",
families_singleparent_female = "B11003_016",
families_singleparent_male = "B11003_010"
)
counties <- tidycensus::get_acs(year=2019,
geography = "county",
variables = "B01001_001",
geometry = TRUE) %>%
sf::st_transform(4326) %>%
dplyr::filter(lengths(sf::st_intersects(., newark)) > 0)
## Getting data from the 2015-2019 5-year ACS
# Find FIPS code for relevant state and counties, make sure no duplicates
st <- unique(substr(counties$GEOID,1,2))
ct <- unique(substr(counties$GEOID,3,5))
data.tr <- get_acs(year=2019,
geography = "tract",
variables = cv,
geometry = TRUE,
state = st,
county=ct,
output="wide") %>%
filter(GEOID %in% tr$GEOID)
## Getting data from the 2015-2019 5-year ACS
data.bg <- get_acs(year=2019,
geography = "block group",
variables = cv,
geometry = TRUE,
state = st,
county=ct,
output="wide") %>%
filter(GEOID %in% bg$GEOID)
## Getting data from the 2015-2019 5-year ACS
Now we construct the variables for our PCA at the Tract and Block Group level.
pcaData.bg <- data.bg %>%
mutate(percBelow200FPL = 100*(1-poverty_level_gr_200FPLE/poverty_level_totalE),
percEducLessHS = 100*(1-(pop_educ_HSdiplomaE +
pop_educ_GEDE +
pop_educ_college_less_1yearE +
pop_educ_college_more_1year_nodegreeE +
pop_educ_assocE +
pop_educ_masterE +
pop_educ_profE +
pop_educ_docE)/pop_educ_attainment_countE),
percMinority = 100*(1-pop_nonhispanic_white_aloneE/pop_race_countE),
income_median = hh_income_medianE,
income_20tile = hh_income_20ptileE,
income_percapita = per_capita_incomeE,
percAssistance = 100*pop_in_hh_with_income_assistanceE /pop_hh_income_assistance_totalE,
rentMedian = rent_medianE,
rentIncomeRatioMedian = median_gross_rent_percent_hh_incomeE,
homeValuemedian = home_value_medianE,
percHomeowners = 100*pop_owner_occupied_unitsE/pop_units_countE,
percBlueCollar = 100*(occupation_sales_office_maleE +
occupation_sales_office_femaleE +
occupation_service_maleE +
occupation_service_femaleE -
occupation_protective_service_maleE -
occupation_protective_service_femaleE +
occupation_natresource_maleE +
occupation_natresource_femaleE +
occupation_prodtrans_maleE +
occupation_prodtrans_femaleE)/occupation_totalE,
percUnemployed=pop_unemployedE/pop_labor_forceE,
percPlumbingComplete = housing_units_plumbing_completeE/housing_units_plumbing_countE,
percFamiliesSingleParent = 100*(families_singleparent_femaleE + families_singleparent_maleE)/families_totalE,
percHH_NoVehicle = 100*(hh_owner_noVehicleE+hh_renter_noVehicleE)/hh_vehicle_countE,
percAgeOver64 = 100*(pop_age_male_65_66E +
pop_age_male_67_69E +
pop_age_male_70_74E +
pop_age_male_75_79E +
pop_age_male_80_84E +
pop_age_male_85upE +
pop_age_female_65_66E +
pop_age_female_67_69E +
pop_age_female_70_74E +
pop_age_female_75_79E +
pop_age_female_80_84E +
pop_age_female_85upE)/pop_age_totalE,
percHH_LangIsolated = 100 * (hh_lang_ltd_apiE +
hh_lang_ltd_otherioE +
hh_lang_ltd_spanishE +
hh_lang_ltd_otherE)/hh_lang_totalE,
percHH_Crowded = 100 * (hh_owner_occupied_1.5_2_perRoomE +
hh_owner_occupied_gr2.0_perRoomE +
hh_renters_1.5_2_perRoomE +
hh_renters_gr2.0_perRoomE)/(hh_owner_occupiedE + hh_rentersE)) %>%
select(GEOID,percBelow200FPL:percHH_Crowded) %>%
st_drop_geometry()
pcaData.tr <- data.tr %>%
mutate(percBelow200FPL = 100*(1-poverty_level_gr_200FPLE/poverty_level_totalE),
percEducLessHS = 100*(1-(pop_educ_HSdiplomaE +
pop_educ_GEDE +
pop_educ_college_less_1yearE +
pop_educ_college_more_1year_nodegreeE +
pop_educ_assocE +
pop_educ_masterE +
pop_educ_profE +
pop_educ_docE)/pop_educ_attainment_countE),
percMinority = 100*(1-pop_nonhispanic_white_aloneE/pop_race_countE),
income_median = hh_income_medianE,
income_20tile = hh_income_20ptileE,
income_percapita = per_capita_incomeE,
percAssistance = 100*pop_in_hh_with_income_assistanceE /pop_hh_income_assistance_totalE,
rentMedian = rent_medianE,
rentIncomeRatioMedian = median_gross_rent_percent_hh_incomeE,
homeValuemedian = home_value_medianE,
percHomeowners = 100*pop_owner_occupied_unitsE/pop_units_countE,
percBlueCollar = 100*(occupation_sales_office_maleE +
occupation_sales_office_femaleE +
occupation_service_maleE +
occupation_service_femaleE -
occupation_protective_service_maleE -
occupation_protective_service_femaleE +
occupation_natresource_maleE +
occupation_natresource_femaleE +
occupation_prodtrans_maleE +
occupation_prodtrans_femaleE)/occupation_totalE,
percUnemployed=pop_unemployedE/pop_labor_forceE,
percPlumbingComplete = housing_units_plumbing_completeE/housing_units_plumbing_countE,
percFamiliesSingleParent = 100*(families_singleparent_femaleE + families_singleparent_maleE)/families_totalE,
percHH_NoVehicle = 100*(hh_owner_noVehicleE+hh_renter_noVehicleE)/hh_vehicle_countE,
percAgeOver64 = 100*(pop_age_male_65_66E +
pop_age_male_67_69E +
pop_age_male_70_74E +
pop_age_male_75_79E +
pop_age_male_80_84E +
pop_age_male_85upE +
pop_age_female_65_66E +
pop_age_female_67_69E +
pop_age_female_70_74E +
pop_age_female_75_79E +
pop_age_female_80_84E +
pop_age_female_85upE)/pop_age_totalE,
percHH_LangIsolated = 100 * (hh_lang_ltd_apiE +
hh_lang_ltd_otherioE +
hh_lang_ltd_spanishE +
hh_lang_ltd_otherE)/hh_lang_totalE,
percHH_Crowded = 100 * (hh_owner_occupied_1.5_2_perRoomE +
hh_owner_occupied_gr2.0_perRoomE +
hh_renters_1.5_2_perRoomE +
hh_renters_gr2.0_perRoomE)/(hh_owner_occupiedE + hh_rentersE)) %>%
select(GEOID,percBelow200FPL:percHH_Crowded) %>%
st_drop_geometry()
Now, we inspect the distributions of the variables to see which ones may be problematic. It appears that the following variables are highly skewed and should be transformed:
It also appears that the following variables are not available or otherwise highly missing at the block group level:
If clients are determined to use these variables, tract-level analysis may be necessary, which may not work for custom geographies that need to be assembled from smaller geographies than tracts.
Here we demonstrate how U.S. Census data can be re-estimated for arbitrary larger geographies. Below, take the example of the City of Newark’s Neighborhood boundaries, compared with Census Block Groups.
alt <- sf::read_sf("https://data.ci.newark.nj.us/dataset/bb616963-ea12-434e-96c1-526820517a66/resource/d59cbbb5-9762-4063-9c21-fc6fa3e1f787/download/newarkneighborhoods2013.geojson")
mapview(bg, layer.name= "Census Block Groups") + mapview(alt, alpha.regions=0, col.regions="darkslategrey", lwd=3, color="darkslategrey", layer.name="Alt. Polygons (Neighborhoods)") + mapview(newark, alpha.regions=0, lwd=2, color="black", col.regions="black",layer.name = "Newark Water Service Area")
In this case, Newarks alternative polygons seem to mostly be combinations of Census Block Groups, with minor discrepancies at borders. Other alternative polygons may be substantially different, so we implement a general weighted-area algorithm to reweight our census variables to the alternative polyogn
skim(pcaData.tr)
| Name | pcaData.tr |
| Number of rows | 130 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 19 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| GEOID | 0 | 1 | 11 | 11 | 0 | 130 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| percBelow200FPL | 2 | 0.98 | 47.47 | 15.23 | 4.80 | 38.74 | 48.73 | 57.90 | 81.23 | ▁▂▇▇▂ |
| percEducLessHS | 0 | 1.00 | 33.75 | 8.25 | 14.43 | 27.83 | 32.10 | 38.86 | 56.73 | ▁▇▇▃▁ |
| percMinority | 0 | 1.00 | 86.68 | 16.22 | 32.45 | 77.87 | 94.78 | 98.77 | 100.00 | ▁▁▂▂▇ |
| income_median | 2 | 0.98 | 45771.48 | 23216.00 | 15333.00 | 31557.50 | 41725.50 | 51843.50 | 181088.00 | ▇▃▁▁▁ |
| income_20tile | 2 | 0.98 | 18936.60 | 10874.67 | 4104.00 | 11358.50 | 15665.50 | 24163.00 | 78286.00 | ▇▅▁▁▁ |
| income_percapita | 1 | 0.99 | 23205.33 | 8746.13 | 6074.00 | 17037.00 | 22203.00 | 26696.00 | 67211.00 | ▆▇▂▁▁ |
| percAssistance | 2 | 0.98 | 39.43 | 19.57 | 0.00 | 25.49 | 40.07 | 51.06 | 91.27 | ▃▆▇▂▁ |
| rentMedian | 3 | 0.98 | 1001.54 | 242.39 | 228.00 | 906.00 | 986.00 | 1130.00 | 1937.00 | ▁▃▇▁▁ |
| rentIncomeRatioMedian | 2 | 0.98 | 35.44 | 7.22 | 20.80 | 29.98 | 34.25 | 39.50 | 51.00 | ▂▇▇▅▃ |
| homeValuemedian | 9 | 0.93 | 255654.54 | 77196.06 | 9999.00 | 208500.00 | 245100.00 | 304900.00 | 588000.00 | ▁▇▇▂▁ |
| percHomeowners | 2 | 0.98 | 31.93 | 16.96 | 0.00 | 20.84 | 29.20 | 40.24 | 91.87 | ▃▇▃▂▁ |
| percBlueCollar | 2 | 0.98 | 70.11 | 10.82 | 30.91 | 65.77 | 71.39 | 77.43 | 87.97 | ▁▁▂▇▅ |
| percUnemployed | 0 | 1.00 | 0.07 | 0.04 | 0.00 | 0.04 | 0.06 | 0.09 | 0.20 | ▅▇▅▂▁ |
| percPlumbingComplete | 2 | 0.98 | 0.99 | 0.01 | 0.95 | 0.99 | 1.00 | 1.00 | 1.00 | ▁▁▁▁▇ |
| percFamiliesSingleParent | 2 | 0.98 | 27.78 | 13.58 | 3.22 | 17.46 | 26.67 | 37.94 | 56.80 | ▆▇▇▆▃ |
| percHH_NoVehicle | 2 | 0.98 | 32.41 | 14.84 | 2.29 | 21.46 | 32.51 | 42.67 | 70.86 | ▃▆▇▅▁ |
| percAgeOver64 | 0 | 1.00 | 11.14 | 4.47 | 0.43 | 8.06 | 10.38 | 14.33 | 24.90 | ▂▇▆▃▁ |
| percHH_LangIsolated | 2 | 0.98 | 14.35 | 12.86 | 0.00 | 4.73 | 8.90 | 23.47 | 52.43 | ▇▂▂▁▁ |
| percHH_Crowded | 2 | 0.98 | 2.61 | 2.38 | 0.00 | 0.64 | 2.16 | 4.06 | 9.29 | ▇▅▂▂▁ |
skim(pcaData.bg)
| Name | pcaData.bg |
| Number of rows | 274 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 19 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| GEOID | 0 | 1 | 12 | 12 | 0 | 274 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| percBelow200FPL | 5 | 0.98 | 48.52 | 19.51 | 0.00 | 37.30 | 49.48 | 60.00 | 100.00 | ▂▅▇▃▁ |
| percEducLessHS | 1 | 1.00 | 33.67 | 11.92 | 0.00 | 25.94 | 32.89 | 40.95 | 69.86 | ▁▅▇▃▁ |
| percMinority | 1 | 1.00 | 88.77 | 15.88 | 31.66 | 85.51 | 96.22 | 99.84 | 100.00 | ▁▁▁▁▇ |
| income_median | 30 | 0.89 | 45129.20 | 22911.84 | 9389.00 | 30984.50 | 41250.00 | 53962.75 | 147500.00 | ▇▇▂▁▁ |
| income_20tile | 274 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| income_percapita | 4 | 0.99 | 22471.93 | 9377.28 | 2087.00 | 16424.75 | 20430.50 | 26568.50 | 60851.00 | ▂▇▃▁▁ |
| percAssistance | 274 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| rentMedian | 20 | 0.93 | 987.73 | 282.91 | 209.00 | 889.00 | 978.00 | 1124.00 | 2264.00 | ▂▇▆▁▁ |
| rentIncomeRatioMedian | 12 | 0.96 | 37.21 | 9.04 | 18.30 | 29.90 | 35.90 | 44.25 | 51.00 | ▂▆▇▅▇ |
| homeValuemedian | 65 | 0.76 | 260973.67 | 93729.69 | 9999.00 | 197200.00 | 243800.00 | 309800.00 | 818200.00 | ▂▇▂▁▁ |
| percHomeowners | 5 | 0.98 | 31.23 | 21.22 | 0.00 | 16.09 | 26.21 | 41.61 | 100.00 | ▇▇▃▂▁ |
| percBlueCollar | 6 | 0.98 | 70.36 | 14.82 | 20.45 | 63.13 | 72.91 | 80.60 | 100.00 | ▁▂▅▇▃ |
| percUnemployed | 1 | 1.00 | 0.08 | 0.06 | 0.00 | 0.04 | 0.07 | 0.10 | 0.60 | ▇▂▁▁▁ |
| percPlumbingComplete | 5 | 0.98 | 0.99 | 0.02 | 0.84 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▁▁▇ |
| percFamiliesSingleParent | 5 | 0.98 | 28.44 | 19.14 | 0.00 | 14.59 | 26.53 | 40.48 | 100.00 | ▇▇▅▁▁ |
| percHH_NoVehicle | 5 | 0.98 | 32.55 | 18.14 | 0.00 | 20.73 | 30.89 | 43.27 | 85.71 | ▅▇▆▃▁ |
| percAgeOver64 | 1 | 1.00 | 11.30 | 6.72 | 0.00 | 6.63 | 10.53 | 14.98 | 41.86 | ▆▇▂▁▁ |
| percHH_LangIsolated | 5 | 0.98 | 14.27 | 15.46 | 0.00 | 1.13 | 9.30 | 22.72 | 81.03 | ▇▃▁▁▁ |
| percHH_Crowded | 5 | 0.98 | 2.51 | 3.94 | 0.00 | 0.00 | 0.00 | 3.77 | 31.08 | ▇▁▁▁▁ |
With this knowledge, we conduct the PCA, first on all the available variables for Block Groups and for Tracts. Below, we report on the first three principal components of the PCA for the tracts and block groups.
is.nan.data.frame <- function(x)
do.call(cbind, lapply(x, is.nan))
pcaData.bg[is.nan(pcaData.bg)]<-NA
x=select(pcaData.bg,-GEOID,-percAssistance, -income_20tile)
pca.bg <- prcomp(x=na.omit(x), scale=TRUE, center=TRUE, na.action=na.omit, rank=3)
y<-predict(pca.bg,pcaData.bg)
pcaData.bg$PC1_allVars <- y[,1]
pcaData.tr[is.nan(pcaData.tr)]<-NA
x=select(pcaData.tr,-GEOID)
pca.tr <- prcomp(x=na.omit(x), scale=TRUE, center=TRUE, na.action=na.omit, rank=3)
y<-predict(pca.tr,pcaData.tr)
pcaData.tr$PC1_allVars <- y[,1]
pcaData.tr$PC1 <- y[,1]
pcaData.tr$PC2 <- y[,2]
pcaData.tr$PC3 <- y[,2]
For tracts, we see that the first component alone explains 42% of the variance in the specified variables. Higher values on this component are associated with lower income measures, higher poverty, public assistance and unemployment, lower rents and housing values, more blue collar employment, higher proportions of minority groups, lower homeownership rates, less vehicle ownership, and higher prevalence of single-parent families, without being particularly informed by linguistic isolation, education levels, or housing crowdedness.
kable(summary(pca.tr)$importance[1:3,1:3],digits=2)
| PC1 | PC2 | PC3 | |
|---|---|---|---|
| Standard deviation | 2.81 | 1.72 | 1.16 |
| Proportion of Variance | 0.42 | 0.16 | 0.07 |
| Cumulative Proportion | 0.42 | 0.57 | 0.64 |
kable(pca.tr$rotation,digits=2)
| PC1 | PC2 | PC3 | |
|---|---|---|---|
| percBelow200FPL | 0.33 | -0.09 | 0.02 |
| percEducLessHS | -0.07 | -0.46 | 0.18 |
| percMinority | 0.25 | 0.27 | -0.13 |
| income_median | -0.33 | 0.04 | -0.06 |
| income_20tile | -0.31 | 0.03 | -0.05 |
| income_percapita | -0.32 | 0.06 | 0.05 |
| percAssistance | 0.26 | 0.02 | 0.13 |
| rentMedian | -0.24 | -0.06 | -0.41 |
| rentIncomeRatioMedian | 0.18 | 0.03 | -0.32 |
| homeValuemedian | -0.24 | -0.24 | 0.03 |
| percHomeowners | -0.27 | 0.21 | -0.09 |
| percBlueCollar | 0.25 | -0.21 | -0.06 |
| percUnemployed | 0.13 | 0.31 | -0.20 |
| percPlumbingComplete | 0.00 | 0.17 | 0.40 |
| percFamiliesSingleParent | 0.26 | 0.14 | -0.11 |
| percHH_NoVehicle | 0.27 | -0.17 | 0.24 |
| percAgeOver64 | -0.10 | 0.12 | 0.54 |
| percHH_LangIsolated | 0.01 | -0.52 | -0.03 |
| percHH_Crowded | 0.04 | -0.30 | -0.29 |
For block groups, we see that the first component alone explains only 29% of the variance in the specified variables (compared to 42% in the tract-level analysis). However, the way the variables contribute to the first component is similar to how they do in the tract-level analysis.
kable(summary(pca.bg)$importance[1:3,1:3],digits=2)
| PC1 | PC2 | PC3 | |
|---|---|---|---|
| Standard deviation | 2.21 | 1.70 | 1.18 |
| Proportion of Variance | 0.29 | 0.17 | 0.08 |
| Cumulative Proportion | 0.29 | 0.46 | 0.54 |
kable(pca.bg$rotation,digits=2)
| PC1 | PC2 | PC3 | |
|---|---|---|---|
| percBelow200FPL | -0.40 | 0.08 | -0.06 |
| percEducLessHS | 0.07 | 0.42 | 0.03 |
| percMinority | -0.23 | -0.35 | -0.04 |
| income_median | 0.40 | -0.05 | -0.15 |
| income_percapita | 0.38 | -0.05 | -0.03 |
| rentMedian | 0.30 | 0.03 | -0.29 |
| rentIncomeRatioMedian | -0.14 | 0.04 | -0.26 |
| homeValuemedian | 0.19 | 0.34 | 0.09 |
| percHomeowners | 0.28 | -0.21 | -0.03 |
| percBlueCollar | -0.22 | 0.22 | 0.28 |
| percUnemployed | -0.17 | -0.29 | -0.03 |
| percPlumbingComplete | 0.01 | -0.13 | 0.47 |
| percFamiliesSingleParent | -0.29 | -0.12 | -0.28 |
| percHH_NoVehicle | -0.28 | 0.21 | 0.11 |
| percAgeOver64 | 0.12 | -0.09 | 0.49 |
| percHH_LangIsolated | -0.01 | 0.51 | 0.04 |
| percHH_Crowded | -0.04 | 0.24 | -0.42 |
Below, we visualize how the first component (PC1) can be used as a multidimensional socioeconomic status/ deprivation index, that incorporates more information and yields different results than just using income or poverty levels or minority group prevalence. Tracts have less missing data than block groups, generally due to U.S. Census Bureau privacy controls that restrict reporting of certain cross-tabulations in small areas. Thus, the smaller the area, the fewer variables are suitable for use in constructing socioeconomic indexes.
bg <- left_join(bg,pcaData.bg,by="GEOID")
tr <- left_join(tr,pcaData.tr,by="GEOID")
tr$PC1_allVars <- -tr$PC1_allVars
m.pc.bg <- mapview(bg,zcol="PC1_allVars", layer.name="Block Groups \n PC1") + mapview(newark,alpha.regions=0,col.regions="red",color="red",layer.name="Newark Water Dept.", lwd=2)
m.pov.bg <- mapview(bg,zcol="income_median", layer.name="Block Groups \n Median HH Income") + mapview(newark,alpha.regions=0,col.regions="red",color="red",layer.name="Newark Water Dept.", lwd=2)
m.pc.tr <- mapview(tr,zcol="PC1_allVars", layer.name="Tracts \n PC1") + mapview(newark,alpha.regions=0,col.regions="red",color="red",layer.name="Newark Water Dept.", lwd=2)
m.pov.tr <- mapview(tr,zcol="income_median", layer.name="Tracts \n Median HH Income") + mapview(newark,alpha.regions=0,col.regions="red",color="red",layer.name="Newark Water Dept.", lwd=2)
mapl <- sync(m.pc.bg, m.pov.bg, m.pc.tr, m.pov.tr)
mapl
Below we demonstrate how using an index can be used to highlight neighborhoods that users may be more interested in from an equity framework. For example, if the distribution of a problematic phenomenon such as taste complaints is randomly distributed, then from an equity perspective, one may wish to prioritize addressing these issues in the most socioeconomically burdened neighborhoods before less burdened neighborhoods.
We simulate tract-level taste complaints per household as a normally distributed random variable with a mean of 0.2 and a standard deviation of 0.1 and a minimum of 0, with no assumed correlation with any equity-related variable:
set.seed(2354367)
tr$taste_complaints_perHH <- rnorm(n=130,mean=0.2, sd=0.1)
tr$taste_complaints_perHH[which(tr$taste_complaints_perHH<0)] <- 0
tr$PC1_allVars <- -tr$PC1_allVars
map.tr <- mapview::mapview(tr,zcol="taste_complaints_perHH", layer.name="Tracts \n Taste Compl per HH.")
map.pc <- mapview::mapview(tr,zcol="PC1_allVars", layer.name="Tracts \n Burden Index")
sync(map.pc, map.tr)
ggplot2::ggplot(data=tr, aes(x=-PC1_allVars,y=taste_complaints_perHH)) + geom_point() + geom_abline() + xlab("Equity Principal Component (higher = more burdened)")
## Warning: Removed 9 rows containing missing values (geom_point).
As we can see above, there is no correlation between our simulated taste complaint indicator and the socioeconomic status index (scale reversed, so that higher numbers correspond to a more “burdened” area).
We can multiply the taste complaint indicator by the “burden” index to create a taste complaint scaled by burden that will highlight different areas to focus on than by focusing on the indicator alone:
tr$taste_scaled <- tr$PC1_allVars* tr$taste_complaints_perHH
map.tr <- mapview::mapview(tr,zcol="taste_complaints_perHH", layer.name="Tracts \n Taste Compl per HH.")
map.pc <- mapview::mapview(tr,zcol="taste_scaled", layer.name="Taste complaints scaled by burden")
sync(map.pc, map.tr)